An incomplete bibliographical inquiry into what the ACM Digital Library has to say about legacy.
Basically two research questions come to mind immidiately. Firstly, have legacy-related publications been on the raise. Secondly, what subtopics can be analyzed?
The first question could afford analysing knowledge being captured in new concept and practices e.g. refactoring or soa. The second question could afford validation against qualitative methods.
In [3]:
import pandas as pd
import networkx as nx
import community
import itertools
import matplotlib.pyplot as plt
import numpy as np
import re
%matplotlib inline
Search "legacy" at ACM Digital Library. Just a simple search, to which the web interface gives 1541 results in mid November 2016. The number of items in the library is ~460000.
A CSV is downloaded from ACM DL, in default sorting order of the library's own idea of relevance... whatever that means for them. BibTeX is also available.
In [4]:
legacybib = pd.read_csv("ACMDL201612108240806.csv")
The available data columns are
In [9]:
legacybib.columns
Out[9]:
A peek at the topmost data items.
In [10]:
legacybib.head(3)
Out[10]:
Does the id field uniquely identify items on the search list? If so, using it as index could be a good idea.
In [11]:
assert 0, sum(legacybib.id.duplicated())
In [5]:
legacybib[legacybib.id.duplicated(keep=False)].head(2 * 2)
Out[5]:
Ok apparently it is used for deduplication. Who knows why are the items twice in the downloaded list.
What datatypes did Pandas infer from the CSV
In [14]:
legacybib.dtypes
Out[14]:
Massage the keywords to be lists. Note that str.split(',') returns a [''], therefore the little if filter in there.
In [6]:
legacybib.keywords.fillna('', inplace=True)
legacybib.keywords = legacybib.keywords.map(lambda l: [k.lower().strip() for k in l.split(',') if k])
Are any items missing the year?
In [7]:
legacybib[legacybib.year.isnull()].year
Out[7]:
To contextualize the legacy search results, get the number of total publications in ACM per year.
These were semimanually extracted from the ACM DL search results listing DOM, with the following Javascript
acmYearly = {};
theChartData.labels.forEach(function(y) {acmYearly[y] = theChartData.datasets[0].data[theChartData.labels.indexOf(y)]});
console.log(acmYearly);
In [8]:
acmPerYearData = { 1951: 43, 1952: 77, 1953: 34, 1954: 71, 1955: 72, 1956: 162, 1957: 144, 1958: 234, 1959: 335,
1960: 302, 1961: 521, 1962: 519, 1963: 451, 1964: 537, 1965: 561, 1966: 633, 1967: 754, 1968: 669, 1969: 907,
1970: 800, 1971: 1103, 1972: 1304, 1973: 1704, 1974: 1698, 1975: 1707, 1976: 2086, 1977: 1943, 1978: 2235, 1979: 1687,
1980: 2152, 1981: 2241, 1982: 2578, 1983: 2485, 1984: 2531, 1985: 2608, 1986: 3143, 1987: 3059, 1988: 3827, 1989: 4155,
1990: 4313, 1991: 4551, 1992: 5019, 1993: 5107, 1994: 5939, 1995: 6179, 1996: 6858, 1997: 7181, 1998: 8003, 1999: 7628,
2000: 9348, 2001: 8691, 2002: 10965, 2003: 11624, 2004: 14493, 2005: 16715, 2006: 19222, 2007: 19865, 2008: 21631, 2009: 23827,
2010: 27039, 2011: 25985, 2012: 27737, 2013: 25832, 2014: 26928, 2015: 27131, 2016: 25557, 2017: 39}
acmPerYear = pd.Series(acmPerYearData)
Let's check how many percent all the 1541 search results given by the website search list. Would be great if this was 100%.
In [18]:
round(len(legacybib) / 1541 * 100, 2)
Out[18]:
With the above peek at the ID field, how many unique items did we receive in the download?
In [19]:
len(legacybib.id.unique())
Out[19]:
Ok capped at 1000 I guess, which brings the percentage of the website search results available to us down to
In [20]:
round(len(legacybib.id.unique()) / 1541 * 100, 2)
Out[20]:
Fraction of published items per year which ACM identifies as relevant for legacy search.
In [21]:
legacybib.year.hist(bins=legacybib.year.max() - legacybib.year.min(), figsize=(10,2))
Out[21]:
What about the ACM Digital Library total, what does it's profile look like over time?
In [22]:
acmPerYear.plot(figsize=(10, 2))
Out[22]:
Similar overall shape, which isn't a surprise. Overlay, with arbitrary scaling of 300.
In [23]:
#plt.hist(legacybib.year.dropna(), label="Year histogram")
plt.plot(legacybib.year.groupby(legacybib.year).count(), label='legacy publication')
plt.plot(acmPerYear * 0.003, label="total publications * 0.003")
plt.legend()
plt.legend(loc='best')
Out[23]:
Right, so they seem to have somewhat similar shape. Legacy as a concept was lagging overall ACM DL, until it caught by increasing growth during 1990s.
What about the ratio of this subset of the whole ACM DL? Has it increased or decreased over time? Ie. has the proportion of publications about legacy changed?
In [24]:
plt.plot(pd.Series(legacybib.groupby(legacybib.year).year.count() / acmPerYear), 'o')
Out[24]:
All the pre-1990 publications are:
In [25]:
legacybib[legacybib.year <= 1990][["year", "title"]].sort_values("year")
Out[25]:
And over 1000 publications after 1990, until 2016. First 10 of which are
In [26]:
legacybib[legacybib.year > 1990][["year", "title"]].sort_values("year").head(10)
Out[26]:
Did something happen around 1990s, as the fraction of publications related to legacy started increasing? Let's plot a global linear regression model, as well as separate linear regression models before and after 1990.
In [27]:
pre1990range = np.arange(legacybib.year.min(), 1991)
post1990range = np.arange(1990, legacybib.year.max())
# Linear regression models
# note the use of np.polyfit
propLm = np.polyfit(pd.Series(legacybib.groupby(legacybib.year).year.count() / acmPerYear).dropna().index, pd.Series(legacybib.groupby(legacybib.year).year.count() / acmPerYear).dropna(), 1)
pre1990 = np.polyfit(pd.Series(legacybib.groupby(legacybib.year).year.count() / acmPerYear)[pre1990range].dropna().index, pd.Series(legacybib.groupby(legacybib.year).year.count() / acmPerYear)[pre1990range].dropna(), 1)
post1990 = np.polyfit(pd.Series(legacybib.groupby(legacybib.year).year.count() / acmPerYear)[post1990range].dropna().index, pd.Series(legacybib.groupby(legacybib.year).year.count() / acmPerYear)[post1990range].dropna(), 1)
# Plot the fractions of legacy vs. all publications, the models, and a legend
plt.plot(pd.Series(legacybib.groupby(legacybib.year).year.count() / acmPerYear), 'o')
plt.plot(np.arange(legacybib.year.min(), legacybib.year.max()), np.poly1d(propLm)(np.arange(legacybib.year.min(), legacybib.year.max())), label="global lm")
plt.plot(pre1990range, np.poly1d(pre1990)(pre1990range), linestyle="dashed", label="pre 1990 lm")
plt.plot(post1990range, np.poly1d(post1990)(post1990range), linestyle="dashed", label="post 1990 lm")
plt.title("Fraction of legacy related publications against ACM")
plt.legend(loc="best")
Out[27]:
Statistical validation of the above would be good, of course, to check against randomness.
The keywords are interesting. All keywords in this dataset are already related to legacy one way or the other, since the data under inspection here is a subset of the total ACM Digital Library.
Keywords of course live a life of their own, and I guess there increase in number forever.
Which keywords are popular?
In [10]:
# this could be a pandas.Series instead of dict
keywordhist = {}
for kws in legacybib.keywords:
for k in kws:
if k in keywordhist:
keywordhist[k] = keywordhist[k] + 1
else:
keywordhist[k] = 1
How many keywords do each item have?
In [11]:
legacybib.keywords.map(lambda kws: len(kws)).describe()
Out[11]:
In [30]:
plt.title("Histogram of numbers of keywords per item")
plt.hist(legacybib.keywords.map(lambda kws: len(kws)), bins=max(legacybib.keywords.map(lambda kws: len(kws))) - 1)
Out[30]:
Ok almost 400 items are without any keywords. There are some outliers, let's inspect the ones with more than 15 keywords. Sounds excessive...
In [31]:
legacybib[legacybib.keywords.map(lambda kws: len(kws)) > 15][["id", "title", "author", "keywords"]]
Out[31]:
And the keyword lists for the above
In [32]:
[keywordlist for keywordlist in legacybib[legacybib.keywords.map(lambda kws: len(kws)) > 15].keywords]
Out[32]:
That is excessive, but seems legit to me.
Total number of unique keywords:
In [33]:
len(keywordhist)
Out[33]:
Of which occur in 10 or more items in the subset
In [34]:
[(k, keywordhist[k]) for k in sorted(keywordhist, key=keywordhist.get, reverse=True) if keywordhist[k] >= 10]
Out[34]:
and further those that occur in 3-10 items
In [35]:
[(k, keywordhist[k]) for k in sorted(keywordhist, key=keywordhist.get, reverse=True) if keywordhist[k] < 10 and keywordhist[k] >= 3]
Out[35]:
Of the remainder, number of keywords which appear on only two items
In [36]:
len([k for k in keywordhist if keywordhist[k] == 2])
Out[36]:
and only on one item
In [37]:
len([k for k in keywordhist if keywordhist[k] == 1])
Out[37]:
In [12]:
sorted([(k, keywordhist[k]) for k in keywordhist if re.match("legacy", k)], key=lambda k: k[1], reverse=True)
Out[12]:
In [13]:
keywordg = nx.Graph()
legacybib.keywords.map(lambda item: keywordg.add_edges_from([p for p in itertools.permutations(item, 2)]), na_action='ignore')
print("Number of components", len([comp for comp in nx.connected_components(keywordg)]))
print("Largest ten components sizes", sorted([len(comp) for comp in nx.connected_components(keywordg)], reverse=True)[:10])
So there is one dominant component, and 150 small ones. It's best to explore them interactively with Gephi.
In [40]:
nx.write_gexf(keywordg, "keywordg.gexf")
Degree distribution of the keyword graph, ie. are there a few nodes which have huge degree and then a large number of nodes with smaller number of connections, like a power network. Additionally, let's see where the keywords with the work legacy in them are placed, by indicating them with green vertical lines. In the left diagram below, hubs are towards the right.
In [14]:
fig, (ax1, ax2) = plt.subplots(1,2)
fig.set_size_inches(10, 2)
ax1.set_title("Keyword degree histogram")
ax1.plot(nx.degree_histogram(keywordg))
ax1.vlines([keywordg.degree(l) for l in keywordg if re.match('legacy', l)], ax1.get_ylim()[0], ax1.get_ylim()[1], colors='green')
ax2.set_title("Keyword degree diagram, log/log")
ax2.loglog(nx.degree_histogram(keywordg))
Out[14]:
Eyeballing the above, most of the legacy keywords are where the mass of the distribution is, ie. at low degrees. One of the legacy nodes is a top hub, and there are some in the mid-ranges.
The top 3 keywords with the highest degree, ie. towards the right of the above graph are:
In [42]:
keywordgDegrees = pd.Series(keywordg.degree()).sort_values(ascending=False)
keywordgDegrees.head(3)
Out[42]:
Let's plot the top hub out.
In [43]:
def plotNeighborhood(graph, ego, color = "green", includeEgo = False):
from math import sqrt
"""
Plot neighbourhood of keyword in graph, after possibly removing the ego.
graph : networkx.Graph-like graph
The graph to get the neighbourhood from
ego : node in graph
The node whose neighbourhood to plot
color : string
Name of the color to use for plotting
includeEgo : bool
Include the ego node
The function defaults to removing the ego node, because by definition
it is connected to each of the nodes in the subgraph. With the ego
removed, the result basically tells how the neighbours are connected
with one another.
"""
plt.rcParams["figure.figsize"] = (10, 10)
subgraph = nx.Graph()
if includeEgo:
subgraph = graph.subgraph(graph.neighbors(ego) + [ego])
else:
subgraph = graph.subgraph(graph.neighbors(ego))
plt.title("Neighbourhood of " + ego + " (" + str(len(subgraph)) + ")")
plt.axis('off')
pos = nx.spring_layout(subgraph, k = 1/sqrt(len(subgraph) * 2))
nx.draw_networkx(subgraph,
pos = pos,
font_size = 9,
node_color = color,
alpha = 0.8,
edge_color = "light" + color)
plt.show()
In [45]:
plotNeighborhood(keywordg, "legacy systems")
In [48]:
plotNeighborhood(keywordg, "legacy software")
Community detection with the Louvain algorithm, explained in Blondel, Guillaume, Lambiotte, Lefebvre: Fast unfolding of communities in large networks (2008). For weighted networks, modularity of a partition is $Q = \frac{1}{2m}\sum_{i, j} \Big[A_{ij} - \frac{k_i k_j}{2m}\Big] \delta(c_i, c_j)$, where $A_{ij}$ is the weight matrix, $c_i$ the community of $i$, and $\delta$-function is 1 if $u = v$, and 0 otherwise (erm what?) and $m = \frac{1}{2}\sum_{i,j}A_{ij}$.
In [46]:
def plotCommunities(graph):
"""Plot community information from a graph.
Basically just copied from http://perso.crans.org/aynaud/communities/index.html
at this point, while in development
"""
# zoom in on something, for dev. purposes
graph = graph.subgraph(graph.neighbors('legacy software'))
# graph = [c for c in nx.connected_component_subgraphs(graph)][0]
graph = max(nx.connected_component_subgraphs(graph), key=len) # I love you Python
partition = community.best_partition(graph)
size = float(len(set(partition.values())))
pos = nx.spring_layout(graph)
count = 0
for com in set(partition.values()):
count = count + 1
list_nodes = [nodes for nodes in partition.keys() if partition[nodes] == com]
plt.axis('off')
nx.draw_networkx_nodes(graph, pos, list_nodes, node_size = 40, node_color = str(count/size), alpha=0.4)
nx.draw_networkx_labels(graph, pos, font_size = 9)
nx.draw_networkx_edges(graph, pos, alpha=0.1)
plt.show()
In [45]:
plotCommunities(keywordg)